Cortex_m backend: Simplify add + linear fusion passes #15526

AdrianLundell · 2025-11-03T15:59:01Z

Reuses the FoldAndAnnotateQParamsPass from the Arm backend to greatly simplify the logic for fusing the ops.

Additionally updates the linear kernel to be numerically correct and computes the kernel_sum aot in the quantized_linear_fusion pass. Note that since this replaces the bias node it typically causes no extra memory usage.

Updates the Linear tests to mirror this, including removing the various matmul tests. Since the linear is handled as a separate op rather than a particular type of matmul these tests are not related anymore.

Removes unnecessary stub definitions in operators.py, operators.yaml and op_quantized_linear.cpp

Leaving a few TODO:s since the patch is large already.

cc @freddan80 @per @zingo @oscarandersson8218 @digantdesai

Reuses the FoldAndAnnotateQParamsPass from the Arm backend to greatly simplify the logic for fusing the ops. Additionally updates the linear kernel to be numerically correct and computes the kernel_sum aot in the quantized_linear_fusion pass. Note that since this replaces the bias node it typically causes no extra memory usage. Updates the Linear tests to mirror this, including removing the various matmul tests. Since the linear is handled as a separate op rather than a particular type of matmul these tests are not related anymore. Removes unnecessary stub definitions in operators.py, operators.yaml and op_quantized_linear.cpp Leaving a few TODO:s since the patch is large already. Signed-off-by: Adrian Lundell <[email protected]> Change-Id: I194228ee3ae4b64a92f3f818afb2e045cc3acf91

pytorch-bot · 2025-11-03T15:59:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15526

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Cancelled Job, 4 Unrelated Failures

As of commit dd7c05e with merge base d07a49a ():

NEW FAILURE - The following job has failed:

pull / unittest-editable / macos / macos-job (gh)
export/tests/test_target_recipes.py::TestTargetRecipes::test_mv3_model

CANCELLED JOB - The following job was cancelled. Please retry:

pull / test-setup-linux-gcc / linux-job (gh)

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-binary-size-linux-gcc / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-openvino-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest / windows / windows-job (gh) (trunk failure)
backends/xnnpack/test/ops/test_conv1d.py::TestConv1d::test_qs8_conv1d_batchnorm_seq
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
backends/xnnpack/test/ops/test_conv1d.py::TestConv1d::test_qs8_conv1d_batchnorm_seq

This comment was automatically generated by Dr. CI and updates every 15 minutes.

psiddh · 2025-11-04T02:28:57Z

@AdrianLundell Just want to better understand the rationale behind removing per channel support / scratch buffer code and in general stateful ops related code ? (I do understand the FoldDQQ related changes to the pass though), (edited)

AdrianLundell · 2025-11-04T09:02:51Z

The implementation didn't produce numerically results originally so I wanted to fix that before adding per-channel support. Not doing that in this patch was mostly to keep the patch size down and prioritizing the most important functionality before going into details.

Regarding the stateful scratch buffer, since we compute the kernel sum which the buffer was used for AOT and replace the bias, there is no reason to do that in the runtime as I see it. The only use-case I see where that would make sense (using the MVE implementation) is if you are very tight on memory and prefer to spend some extra time during each interference to compute the kernel sum in a scratch buffer rather than keeping it in memory. But then again there is no reason to do everything in one single patch, better to get a good minimal working example running IMO.

psiddh · 2025-11-04T22:29:28Z

The implementation didn't produce numerically results originally so I wanted to fix that before adding per-channel support. Not doing that in this patch was mostly to keep the patch size down and prioritizing the most important functionality before going into details.

Regarding the stateful scratch buffer, since we compute the kernel sum which the buffer was used for AOT and replace the bias, there is no reason to do that in the runtime as I see it. The only use-case I see where that would make sense (using the MVE implementation) is if you are very tight on memory and prefer to spend some extra time during each interference to compute the kernel sum in a scratch buffer rather than keeping it in memory. But then again there is no reason to do everything in one single patch, better to get a good minimal working example running IMO.

Yes, we need scratch buffer for DSP & MVE (Helium) implementations, I think
Given that, I would suggest, to keep CMSISScratchBufferContext file as-is , for now. The idea was to evolve this into a generic ScratchBuffer impl that could be reused for couple of diff optimization scenarios
How are you testing these changes ? E2E on FVP (or) some real device for numerical results. Can you share the steps please ?
Can we add regression tests specifically for the numerical correctness fixes mentioned ?

backends/cortex_m/ops/op_quantized_linear.cpp

backends/cortex_m/test/ops/test_linear.py

psiddh · 2025-11-05T04:38:03Z

Also can you add comments in the code explaining why scratch buffer context is simplified and when it might still be needed ? Overall I think this is a solid refactor in the right direction

…nge-1137032

AdrianLundell · 2025-11-05T09:02:03Z

* Yes, we need scratch buffer for DSP & MVE (Helium) implementations, I think

We will need scratch buffer for some ops yes, but I am not sure if the header solution for keeping track of the state will ever be needed. As I see it, we can either do the computation AOT as in this case, or it will be reset with every inference as for CONV. Maybe I am missing some use-case here?

* Given that, I would suggest, to keep CMSISScratchBufferContext file as-is , for now. The idea was to evolve this into a generic ScratchBuffer impl that could be reused for couple of diff optimization scenarios

Agree we can keep it for potential future use.

* How are you testing these changes ? E2E on FVP (or) some real device for numerical results.  Can you share the steps please ?

The python op implementation is tested in the test_dialect_linear in backends/cortex_m/test/ops/test_linear.py and the cpp implementation is tested on FVP in the test_implementation_linear tests. The tests for running this after having downloaded executorch are:

python3 -m venv env
source env/bin/activate
pip install -U pip setuptools wheel cmake pytest-cov zstd ninja
./install_executorch.sh --editable
examples/arm/setup.sh --i-agree-to-the-contained-eula
source examples/arm/ethos-u-scratch/setup_path.sh

#To test installation
examples/arm/run.sh

# To build cortex-m test runner
backends/cortex_m/test/build_test_runner.sh

# To run linear tests:
pytest --config-file=backends/arm/test/pytest.ini backends/cortex_m/test/ops/test_linear.py

* Can we add regression tests specifically for the numerical correctness fixes mentioned ?

These are available in test_linear.py. These tests were not passing before the patch.

Signed-off-by: Adrian Lundell <[email protected]>

AdrianLundell · 2025-11-05T09:12:38Z

Also can you add comments in the code explaining why scratch buffer context is simplified and when it might still be needed ? Overall I think this is a solid refactor in the right direction

I think the logic is well explained in the commit message + the quantized_linear_fusion_pass.py

psiddh · 2025-11-05T16:01:41Z

Thanks, looks good over all, approving as we also need to move faster.
Also can you check FVP CI failures (not related ?)

[EDIT:] Looks like there are compilation issues / CI failures (not sure if it is related to #15612)

…nge-1120218

Fix a merge issue causing the build to fail + update tests after merging of pytorch#15590 Signed-off-by: Adrian Lundell <[email protected]>

AdrianLundell · 2025-11-06T12:44:27Z

Fails in XNNPACK/ios-arm64-coreml-fp16 not related

Reuses the FoldAndAnnotateQParamsPass from the Arm backend to greatly simplify the logic for fusing the ops. Additionally updates the linear kernel to be numerically correct and computes the kernel_sum aot in the quantized_linear_fusion pass. Note that since this replaces the bias node it typically causes no extra memory usage. Updates the Linear tests to mirror this, including removing the various matmul tests. Since the linear is handled as a separate op rather than a particular type of matmul these tests are not related anymore. Removes unnecessary stub definitions in operators.py, operators.yaml and op_quantized_linear.cpp Leaving a few TODO:s since the patch is large already. Signed-off-by: Adrian Lundell <[email protected]>

This reverts commit 9843222.

AdrianLundell requested a review from psiddh November 3, 2025 15:59

AdrianLundell added partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm release notes: none Do not include this in the release notes labels Nov 3, 2025

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 3, 2025

psiddh reviewed Nov 5, 2025

View reviewed changes

backends/cortex_m/ops/op_quantized_linear.cpp Outdated Show resolved Hide resolved

psiddh reviewed Nov 5, 2025

View reviewed changes

backends/cortex_m/test/ops/test_linear.py Show resolved Hide resolved

Merge branch 'main' of https://github.com/pytorch/executorch into cha…

9b0b75a

…nge-1137032

Address PR comments

b527289

Signed-off-by: Adrian Lundell <[email protected]>

AdrianLundell mentioned this pull request Nov 5, 2025

Cortex_m backend: Add mul operator #15591

Merged

psiddh approved these changes Nov 5, 2025

View reviewed changes

AdrianLundell added 2 commits November 6, 2025 10:01

Merge branch 'main' of https://github.com/pytorch/executorch into cha…

8dda27b

…nge-1120218

Address merge issues

dd7c05e

Fix a merge issue causing the build to fail + update tests after merging of pytorch#15590 Signed-off-by: Adrian Lundell <[email protected]>

AdrianLundell merged commit 9843222 into pytorch:main Nov 6, 2025
136 of 142 checks passed

metascroy added a commit that referenced this pull request Nov 6, 2025

Revert "Cortex_m backend: Simplify add + linear fusion passes (#15526)"

4daab85

This reverts commit 9843222.

metascroy mentioned this pull request Nov 6, 2025

Revert "Cortex_m backend: Simplify add + linear fusion passes" #15659

Closed

AdrianLundell deleted the change-1120218 branch November 7, 2025 12:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cortex_m backend: Simplify add + linear fusion passes #15526

Cortex_m backend: Simplify add + linear fusion passes #15526

Uh oh!

AdrianLundell commented Nov 3, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Nov 3, 2025 •

edited

Loading

Uh oh!

psiddh commented Nov 4, 2025

Uh oh!

AdrianLundell commented Nov 4, 2025

Uh oh!

psiddh commented Nov 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

psiddh commented Nov 5, 2025

Uh oh!

AdrianLundell commented Nov 5, 2025

Uh oh!

AdrianLundell commented Nov 5, 2025

Uh oh!

psiddh commented Nov 5, 2025 •

edited

Loading

Uh oh!

AdrianLundell commented Nov 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Cortex_m backend: Simplify add + linear fusion passes #15526

Cortex_m backend: Simplify add + linear fusion passes #15526

Uh oh!

Conversation

AdrianLundell commented Nov 3, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15526

❌ 1 New Failure, 1 Cancelled Job, 4 Unrelated Failures

Uh oh!

psiddh commented Nov 4, 2025

Uh oh!

AdrianLundell commented Nov 4, 2025

Uh oh!

psiddh commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

psiddh commented Nov 5, 2025

Uh oh!

AdrianLundell commented Nov 5, 2025

Uh oh!

AdrianLundell commented Nov 5, 2025

Uh oh!

psiddh commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AdrianLundell commented Nov 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AdrianLundell commented Nov 3, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 3, 2025 •

edited

Loading

psiddh commented Nov 4, 2025 •

edited

Loading

psiddh commented Nov 5, 2025 •

edited

Loading